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Abstract 

The application of semantic technologies to the integration of biological data and the interoperability of 
bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons 
hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the 
BioHackathons held in 201 1 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies 
in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description 
Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata 
for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and 
the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by 
these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from 
these activities are discussed. 

Keywords: BioHackathon, Bioinformatics, Semantic Web, Web services. Ontology, Visualization, Knowledge 
representation. Databases, Semantic interoperability. Data models. Data sharing. Data integration 
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Introduction 

In life sciences, the Semantic Web is an enabling technol- 
ogy which could significantly improve the quality and ef- 
fectiveness of the integration of heterogeneous biomedical 
resources. The first wave of life science Semantic Web 
publishing focused on availability - exposing data as RDF 
without significant consideration for the quality of the 
data or the adequacy or accuracy of the RDF model used. 
This allowed a proliferation of proof-of-concept projects 
that highlighted the potential of Semantic technologies. 
However, now that we are entering a phase of adoption of 
Semantic Web technologies in research, quality of data 
publication must become a serious consideration. This is 
a prerequisite for the development of translational re- 
search and for achieving ambitious goals such as personal- 
ized medicine. 

While Semantic technologies, in and of themselves, do 
not fuUy solve the interoperability and integration problem, 
they provide a framework within which interoperability is 
dramatically facilitated by requiring fewer pre-coordinated 
agreements between participants and enabling unantici- 
pated post hoc integration of their resources. Nevertheless, 
certain choices must be made, in a harmonized manner, to 
maximize interoperability. The yearly BioHackathon series 
[1-3] of events attempts to provide the environment within 
which these choices can be explored, evaluated, and then 
implemented on a collaborative and community-guided 
basis. These BioHackathons were hosted by the National 
Bioscience Database Center (NBDC) [4] and the Database 
Center for Life Science (DBCLS) [5] as a part of the Inte- 
grated Database Project to integrate life science databases 
in Japan. In order to take advantage of the latest technolo- 
gies for the integration of heterogeneous life science data, 
researchers and developers from around the world were 
invited to these hackathons. 



This paper contains an overview of the activities and 
outcomes of two highly interrelated BioHackathon events 
which took place in 2011 [6] and 2012 [7]. The themes of 
these two events focused on representation, publication, 
and exploration of bioinformatics data and tools using 
standards and guidelines set out by the Linked Data and 
Semantic Web initiatives. 

Review 

Semantic Web technologies are formalized as World 
Wide Web consortium (W3C) standards aimed at creat- 
ing general-purpose, long-lived data representation, ex- 
change, and integration formats that replace current ad 
hoc solutions. However, because they are general-purpose 
standards, many issues need to be addressed and agreed- 
upon by the community in order to apply them success- 
fully to the integration and interoperability problems of 
the life science domain. Therefore, participants of the 
BioHackathons fall into sub-groups of interest within 
the life sciences, representing the specific needs and 
strengths of their individual communities within the 
broader context of life science informatics. Though 
there were multiple specific activity groups under each 
of the following headings, and there was overlap and 
cross-talk between the activities of each group, we will 
organize this review under the five general categories of: 
RDF data. Ontology, Metadata, Platforms and Applica- 
tions (Figure 1). Results and issues raised by each group 
are briefly summarized in the Table 1. We also note that 
many groups have or will publish their respective out- 
comes in individual publications. 

RDF data 

In terms of RDF data generation, data were generated 
for genomic and glycomic databases (domain-specific 



- RDF data 

Domain specific models 
Genome data 
Glycome data 

Text processing 
Text extraction from PDF 
Named entity recognition 
Natural language queries 



-Ontology 

IRI mapping and normalization 
Environmental ontologies 
Lexical resources 
Enzyme reaction equations 




- Metadata 

Service quality indicators 
Data content descriptors 



infrastructure - 

RDFization tools 
Triple stores 



Applications 

Semantic Web visualization 
Ontology mapping visualization 
Identifier conversion service 
Semantic query via voice recognition 



Figure 1 Overview of categories and topics raised during the BiolHackathons of 201 1 and 2012. Lines between tine boxes represent 
semantic relationsliips between categories. 
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Table 1 Summary of investigated issues and results covered during BioHackathons 2011 and 2012 

RDF data 

Domain specific models 

Genome and proteome data 
issue: No standard RDF data model and tools existed for major genomic data 
Result. Created FALDO, INSDC, GFF, GVF ontologies and developed converters 
Software Converters are now packaged in the Biolnterchange tool; improved PSICQUIC service 

Glycome data 

Issue: Glycome and proteome databases are not effectively linked 

Result Developed a standard RDF representation for carbohydrate structures by BCSDB, GlycomeDB, GLYCOSCIENCES.de, 
JCGGDB, MonosaccharideDB, RINGS, UniCarbKB and UniProt developers 

Software: RDFized data from these databases, stored them in Virtuoso and tested SPARQL queries among the different data 
resources 

Text processing 

Text extraction from PDF and metadata retrieval 

Issue: Text for mining is often buried in the PDF formatted literature and requires preprocessing 

Result Incorporated a tool for text extraction combined with a metadata retrieval service for DOIs or PIVllDs 

Software: Used PDFX for text extraction; retrieved metadata by the TogoDoc service 
Named entity recognition and RDF generation 

Issue: No standard existed for combining the results of various NER tools 

Result Developed a system for combining, viewing, and editing the extracted gene names to provide RDF data 
Software: Extended SIO ontology for NER and newly developed the Biolnterchange tool for RDF generation 
Natural language query conversion to SPARQL 
Issue: Automatic conversion of natural language queries to SPARQL queries is necessary to develop a human friendly interface 
Result Incorporated the SNOMED-CT dataset to answer biomedical questions and improved linguistic analysis 
Software: Improved the in-house LODQA system; used ontologies from BioPortal 

Ontology 

IRI mapping and normalization 

Issue: IRIs for entities automatically generated by BioPortal do not always match with submitted RDF-based ontologies 
Result Normalized IRIs in the BioPortal SPARQL endpoint as either the provider IRI, the ldentifiers.org IRI, or the Bio2RDF IRI 
Software: Used services of BioPortal, the MIRIAM registry, ldentifires.org and Bio2RDF 

Environmental ontologies for metagenomics 

Issue: Semantically controlled description of a sample's original environment is needed in the domain of metagenomics 
Result Developed the Metagenome Environment Qntology (MEO) for the MicrobeDB project 
Software: References the Environment Qntology (EnvQ) and other ontologies 

Lexical resources 

Issue: Standard machine-readable English-Japanese / Japanese-English dictionaries are required for multilingual utilization of RDF 
data 

Result. Developed ontology for LSD to serialize the lexical resource in RDF and published it at a SPARQL endpoint 
Software: Data provided by the Life Science Dictionary (LSD) project 
Enzyme reaction equations 

Issue. New ontology must be developed to represent incomplete enzyme reactions which are not supported by lUBMB 
Result Designed semantic representation of incomplete reactions with terms to describe chemical transformation patterns 
Software: Obtained data from the KEGG database and the result is available at GenomeNet 

Metadata 

Service quality indicators 

Issue: Quality of the published datasets (SPARQL endpoints) is not clearly measured 
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Table 1 Summary of investigated issues and results covered during BioHackathons 2011 and 2012 (Continued) 

Result: Measured the availability, response time, content amount and other quality metrics of SPARQL endpoints 

Software Web site is under development to illustrate the summary of periodical measurements 

Database content descriptors 

Issue: Uniform description of the core attributes of biological databases should be semantically described 

Result: Developed the RDF Schema for the BioDBCore and improved the BioDBCore Web interface for submission and retrieval 

Software: Evaluated identifiers for DBs in NAR, DBpedia, ldentifiers.org and ORCID and vocabularies from Biositemaps, EDAM, BRO 
and OBI 

Generic metadata for dataset description 

Issue: Database catalogue metadata needs to be machine-readable for enabling automatic discovery 

Result: Conventions to describe the nature and availability of datasets will be formalized as a community agreement 

Software: Members from the W3C HCLS, DBCLS, MEDALS, BioDBCore, Biological Linked Open Data, Biositemaps, Uniprot, 
Bio2RDF, Biogateway, Open PHACTS, EURECA, and ldentifiers.org continue the discussion in teleconferences 

Platforms 

RDFization tools 

Issue: RDF generation tools supporting various data formats and data sources are not yet sufficient 

Result: Tools to generate RDF from CSV, TSV, XML, GFF3, GVF and other formats including text mining results were developed 
Software: Biolnterchange can be used as a tool, Web services and libraries; bio-table is a generic tool for tabular data 
Triple stores 

Issue: Survey is needed to test scalability of distributed/cluster-based triple stores for multi-resource integration 

Result: Hadoop-based and Cluster-based triple stores were still immature and federated queries on OWLIM-SE was still inefficient 

Software: HadoopRDF, SHARD and WebPIE for Hadoop-based triple stores; 4store and bigdata for Cluster-based triple stores 

Applications 

Semantic Web exploration and visualization 

Issue: Interacrive explorarion and visualization tools for Semantic Web resources are required to make effective queries 

Result: Tools are reviewed from viewpoints of requirements and availability, features, assistance and support technical aspects, 
and specificity to life sciences use cases 

Software: More than 30 tools currently available are reviewed and classified for benchmarking and evaluations in the future 

Ontology mapping visualization 

Issue: Visualization of ontology mapping is required to understand how different ontologies with relating concepts are 
interconnected 

Result: Ontology mappings of all BioPortal ontologies and a subset of BioPortal ontologies suitable for OntoFinder/Factory were 
visualized 

Software: Applicability of Google Fusion Tables and Gephi were investigated 
Identifier conversion service 

Issue: Multiple synonyms for the same data inhibits cross-resource querying and data mining 

Result: Developed a new service to extract cross references from UniProt and KEGG databases, eliminate redundancy and 
visualize the result 

Software: G-Links resolves and retrieves all corresponding resource URIs 

Semantic query via voice recognition 

Issue: Intuitive search interface similar to "Siri for biologists" would be useful 

Result Developed a context-aware virtual research assistant Genie which recognizes spoken English and replies in a synthesized 
voice 

Software: The G-language GAE, G-language Maps, KBWS EMBASSY and EMBOSS, and G-Links are used for Genie 



models) and from the literature using text processing Domain specific models 

technologies. We describe these two subcategories Genome and proteome data Due to the high-throughput 
here. generation of genomic data, it is of high priority to 
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generate RDF models for both nucleotide sequence anno- 
tations and amino acid sequence annotations. Up to 
now, nucleotide sequence annotations are provided in a 
variety of formats such as the International Nucleotide 
Sequence Database Collaboration (INSDC) [8], Generic 
Feature Format (GFF) [9] and Genome Variation Format 
(GVF) [10]. By RDFizing this information, all of the anno- 
tations from various sequencing projects can be integrated 
in a straightforward manner. This would in turn accom- 
modate the data integration requirements of the H-InvDB 
[11]. In general, due to the large variety of genomic anno- 
tations possible, it was decided that in the first iteration of 
a genomic RDF model, opaque Universally Unique IDenti- 
fiers (UUIDs) are to be used to represent sequence fea- 
tures. Each UUID would then be typed with its appropriate 
ontology, such as Sequence Ontology (SO), and sequence 
location would be specified using Feature Annotation Lo- 
cation Description Ontology (FALDO) [12,13]. FALDO 
was newly developed at the BioHackathon 2012 by repre- 
sentatives of UniProt [14], DDBJ [15] and genome scien- 
tists for the purpose of generically locating regions on the 
biological sequences (e.g., modification sites on a protein 
sequence, fuzzy promoter locations on a DNA sequence 
etc.). A locally-defined vocabulary was used to annotate 
other aspects such as sequence version and synonymy. 
Thus, a generic system for nucleotide and amino acid se- 
quence annotations could be proposed. Converters were 
also developed that would output compatible RDF docu- 
ments, such as HMMER3 [16], GenBank/DDBJ [17], GTF 
[18] and GFF20WL [19]. The RDF output for Proteomics 
Standard Initiative Common QUery InterfaCe (PSICQUIC) 
[20], a tool to retrieve molecular interaction data from 
multiple repositories with more than 150 milion interac- 
tions available at the time of writing, was modified during 
the Biohackathon 2011 to improve the mapping of identi- 
fiers and ontologies. Identifiers.org was chosen as the pro- 
vider for the new IRIs for the interacting proteins and 
ontology terms to allow a better integration with other 
sources. PSICQUIC RDF output is based on the popular 
BioPAX format [21] for interactions and pathways. 

Glycome data The Glycomics working group consisted 
of developers from the major glycomics databases includ- 
ing Bacterial Carbohydrate Structure Database (BCSDB) 
[22], GlycomeDB [23,24], GLYCOSCIENCES.de [25], Japan 
Consortium for Glycobiology and Glycotechnology Data- 
base (JCGGDB) [26], MonosaccharideDB [27], Resource 
for INformatics of Glycomes at Soka (RINGS) [28], and 
UniCarbKB [29]. These databases contain information 
about glycan structures, or complex carbohydrates, which 
are often covalently linked to proteins forming glycopro- 
teins. The connections between glycomics and proteomics 
databases are required to accurately describe the proper- 
ties and potential biological functions of glycoproteins. In 



order to establish such a connection this working group 
cooperated with UniProt developers present at the Bio- 
Hackathon to agree upon and develop a standard RDF 
representation for carbohydrate structures, along with the 
relevant biological and bibliographic annotations and ex- 
perimental evidence. Data from the individual databases 
have been exported in the newly developed RDF format 
(version 0.1) and stored in a triple store, allowing for 
cross-database queries. Several proof-of-concept queries 
were tested to show that federated queries could be made 
across multiple databases to demonstrate the potential for 
this technology in glycomics research. For example, both 
UniProt and JCGGDB are important databases in their re- 
spective domains of protein sequences and glycomics data. 
Moreover, UniCarbKB is becoming an important glyco- 
mics resource as well. However, since UniCarbKB is not 
linked with JCGGDB, a SPARQL query was described to 
find the JCGGDB entries for each respective UniCarbKB 
entry. Aoki-Kinoshita et al., 2013 [30] this was made 
possible by the integration of UniCarbKB, JCGGDB and 
GlycomeDB data, which served as the link between the 
former two datasets. This would not have been possible with- 
out agreement upon the standardization of the pertinent gly- 
comics data in each database, discussed at BioHackathons. 

Text processing 

The Data Mining and Natural Language Processing (NLP) 
groups focused their efforts in two primary domains: in- 
formation extraction from scientific text - particularly 
from PDF articles - in the form of ontology-grounded tri- 
ples, and the conversion of natural language questions 
into triples and/or SPARQL queries. Both of these were 
pursued with an eye to standardization and interoperabil- 
ity between life science databases. 

Text extraction from PDF and metadata retrieval The 

first step in information extraction is ensuring that accur- 
ate plain-text representations of scientific documents are 
available. A widely recognized "choke point" that inhibits 
the processing and mining of vast biomedical document 
stores has been the fact that the bulk of information 
within them is often available only as PDF-formatted doc- 
uments. Access to this information is crucial for a variety 
of needs, including accessibility to model organism data- 
base curators and the population of RDF triple stores. In 
confronting this issue, the BioHackers worked on a novel 
software project called PDFX [31,32], which automatically 
converts the PDF scientific articles to XML form. The 
general use case was to include PDFX as a pre-processing 
step within a wide variety of more involved processing 
pipelines, such as the additional concerns of the Bio- 
Hackathon data mining and NLP groups presented next. 
Complementing text extraction from PDF documents, 
when this process is employed, it also becomes necessary 
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to retrieve relevant metadata information. This was done 
using DBCLS's TogoDoc [33] literature management and 
recommendation system, which detects the Digital Object 
Identifier (DOI) or PubMed identifiers of PDF submis- 
sions in order to retrieve metadata information such as 
MeSH terms and make recommendations to users. 

Named entity recognition and RDF generation Once 
text is in processable form, the next phase of informa- 
tion extraction is entity recognition within the text. The 
field of gene name extraction suffers from a prevalence 
of diverse annotation schemata, ontologies, definitions 
of semantic classes, and standards regarding where the 
edges of gene names should be marked within a corpus 
(an annotated collection of topic-specific text). In 2011, 
the NLP/text mining group worked on an application 
for combining, viewing and editing the outputs of a var- 
iety of gene-mention-detection systems, with the goal of 
providing RDF outputs of protein/gene annotation tools 
such as GNAT [34], GeneTUKit [35], and BANNER [36]. 
The Annotation Ontology was used to represent these 
metadata. However, at the 2012 event, the SIO ontology 
[37] was extended to enable representation of entity- 
recognition outputs directly in RDF: resources were de- 
scribed in terms of a number of novel relation types (prop- 
erties) and incorporated in an inheritance and partonymy 
hierarchy. Using these various components as a proof of 
concept, the NLP sub-group began developing a generic 
RDFization framework, Biolnterchange [38], comprised of 
three pipelined steps - data deserialization, object model 
generation, and RDF serialization - to enable easy data 
conversion into RDF with automatic ontological mappings 
primarily to SIO and secondarily to other ontologies. 

Natural language query conversion to SPARQL The 

final activity within the NLP theme was the conversion 
of natural language queries to SPARQL queries. SPARQL 
queries are a natural interface to RDF triple-store end- 
points, but they remain challenging to construct, even for 
those with intimate knowledge of the target data schema. 
It would be easier, for example, to enable users to ask a 
question such as "What is the sequence length for human 
TP53?" and receive an answer from the UniProt database, 
based on a SPARQL query that the system constructs 
automatically. A pre-existing tool fi-om the DBCLS that 
can accomplish natural-language-to-SPARQL conversion 
was targeted and customized for the SNOMED-CT [39] 
dataset in BioPortal [40]. A large set of natural language 
test queries were developed, and for a subset of those 
queries the post-conversion output was analyzed and 
compared to a manually created gold standard output; 
subsequently, the group undertook a linguistic analysis of 
what conversions would have to be carried out in order to 
transform the current system output to the gold standard. 



These efforts included using natural language generation 
technology to build a Python solution that generates hun- 
dreds of morphological and syntactic variants of various 
natural language question types. 

Ontology 

IRI mapping and normalization 

The first step in any semantic integration activity is to 
agree on the identifiers for various concepts. BioPortal, a 
central repository for biomedical ontologies, allows users 
to download original ontology files in a variety of formats 
(OWL [41], OBO [42], etc.), but also makes these ontol- 
ogies available using RDF through a Web service and 
SPARQL endpoint [43]. In RDF, entities (classes, relations 
and individuals) are identified using an Internationalized 
Resource Identifier (IRI); however, the identifiers that are 
automatically generated by BioPortal do not always match 
with those used in submitted RDF-based ontologies, thereby 
impeding integration across ontologies. Moreover, since 
ontologies are also used to semantically annotate biomed- 
ical data, there is a lack of semantic integration between 
data and ontology. BioHackathon activities included sur- 
veying, mapping, and normalizing the IRIs present in the 
RDF-based ontologies found in the BioPortal SPARQL 
endpoint to a canonical set of IRIs in a custom dataset 
and namespace registry, primarily used by the Bio2RDF 
project [44]. This registry is being integrated with the 
MIRIAM Registry [45] which powers Identifiers.org, 
thereby enabling users to select either the provider IRI 
(if available), the Identifiers.org IRI (if available), or the 
Bio2RDF IRI (for all data and ontologies) [46]. 

Environmental ontologies for metagenomics 

In the domain of metagenomics, establishing a semantically 
controlled description of a sample's original environment is 
essential for reliably archiving and retrieving relevant data- 
sets. The BioHackathon resulted in a strategy for the re- 
engineering of the Metagenome Environment Ontology 
(MEO) [47], closely linked to the MicrobeDB project [48], 
to serve as community-specific portal to resources such as 
the Environment Ontology (EnvO) [49] . In this role, MEO 
will deliver curated, high-value subsets of such resources 
to the (meta)genomics community for use in efficient, se- 
mantically controlled annotation of sample environments. 
Additionally, MEO will enrich and shape the ontologies 
and vocabularies it references through persistently con- 
solidating and submitting feedback from its users. 

An ontology for lexical resources 

The Life Science Dictionary (LSD) [50] consists of various 
lexical resources including English-Japanese/Japanese- 
English dictionaries with >230,000 terms, a thesaurus using 
the MeSH vocabulary [51,52], and co-occurring data that 
show how often a pair of terms appear in a MEDLINE [53] 
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entry. LSD has been edited and maintained by the LSD 
project since 1993 and provides a search service on the 
Web, as well as a downloadable version. To assist with 
machine-readability of this important lexical resource, the 
group developed an ontology for this dataset [54], and an 
RDF serialization of the LSD was designed and coded at 
the BioHackathon. As a result, a total of 5,600,000 triples 
were generated and made available at the SPARQL end- 
point [55]. 

An ontology for incomplete enzyme reaction equations 

Incomplete enzyme reactions are not of interest to Inter- 
national Union of Biochemistry and Molecular Biology 
(lUBMB; who manage EC numbers) [56], but are com- 
mon in metabolomics. Enzymes and reactions are de- 
scribed in Gene Ontology (GO) [57] and Enzyme 
Mechanism Ontology (EMO) [58], but they just follow 
the classification of lUBMB. It would be helpful to es- 
tablish a structured representation to describe the avail- 
able knowledge out of the reaction of interest even if 
the equation is not complete. Semantic representation 
of incomplete enzyme reaction equations was designed 
based on ontological principles. About 6,800 complete 
reaction equations taken from the KEGG [59,60] database 
were decomposed into 13,733 incomplete reactions, from 
which 2,748 chemical transformation patterns were ob- 
tained. They were classified into a semantic data structure, 
consisting of about 1,100 terms (functional groups, sub- 
structures, and reaction types) commonly used in organic 
chemistry and biochemistry. We keep curating the ontol- 
ogy for incomplete enzyme reaction equations aiming at 
its use in metabolome and other omics-level researches 
(available at GenomeNet [61]). 

Metadata 

Metadata activities at the BioHackathon could be grouped 
into three areas of focus: service quality indicators, data- 
base content descriptors, and a broader inclusive discus- 
sion of generic metadata that could be used to characterize 
datasets in a database catalogue for enhanced data discov- 
ery, assessment, and access (not limited to but still useful 
for biodatabases). 

Service quality indicators 

With respect to data quality, the BioHackers coined the 
phrase "Yummy Data" as a shorthand way of expressing 
not only data quality, but more importantly, the ability 
to explicitly determine the quality of a given dataset. 
While quality of the published data is an important issue, 
it is a domain that depends as much on the underlying 
biological experiments as the code that analyses them. As 
such, the data quality working group at the BioHackathon 
focused on the issue of testing the quality of the published 
data endpoint, with respect to endpoint availability and 



other metrics. Therefore, the Yummy Data project [62] 
was initiated that periodically inspects the availability, re- 
sponse time, content amount and a few quality metrics for 
a selection of SPARQL endpoints of interest to biomedical 
investigators. While neither defining, nor executing, an 
exhaustive set of useful quality-measurements, it is hoped 
that this software may act as a starting point that encour- 
ages others to measure the "yumminess" of the data they 
provide, and thereby improve the quality of the published 
semantic resources for the global community. 

Database content descriptors 

The BioDBCore project [63,64] has created a community- 
defined, uniform, generic description of the core attributes 
of biological databases that will allow potential users of 
that database to determine its suitability for their task at 
hand (e.g. taxonomic range, update frequency, etc.). The 
proposed BioDBCore core descriptors are overseen by the 
International Society for Biocuration (ISB) [65], in collab- 
oration with the BioSharing initiative [66]. One of the key 
activities of BioDBCore discussion at the BioHackathon 
was to define the RDF Schema and relevant annotation 
vocabularies and ontologies capable of representing the 
nature of biological data resources. As mentioned above, 
RDF representations necessitate the choice of a stable URI 
for each resource. The persistent identifiers considered for 
biological databases included NAR database collection 
[67,68], DBpedia [69,70], Identifiers.org and ORCID [71], 
while vocabularies from Biositemaps [72], EMBRACE 
Data and Methods (EDAM) [73], Biomedical Resource 
Ontology (BRO) [74] and The Ontology for Biomedical 
Investigations (OBI) [75] were evaluated to describe fea- 
tures such as resource and data types, and area-of- 
research. The exploration involved several specific use 
cases, including METI Life science integrated database 
portal (MEDALS) [76] and NBDC/DBCLS [77]. Another 
key activity at the hackathons was focused on the 
BioDBCore Web interface [78], both for submission and 
retrieval. Open issues include how to specify the useful 
interconnectivity between databases, for example, in 
planning cross-resource queries, and how to describe the 
content of biological resources in a machine-readable way 
to make it easily queried by SPARQL even if the vocabu- 
laries of any given resources are used. Currently, the group 
is considering the idea of using the named graph of a re- 
source to store these kinds of metadata. There was also 
inter-group discussion of how to integrate BioDBCore 
with other projects such as DRCAT [79], which defines a 
similar, overlapping set of biological resources and their 
features. 

Generic metadata for dataset description 

The generic metadata discussion started by defining the 
problem of making database catalogue metadata machine- 
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readable, so that a given dataset is automatically discover- 
able and accessible by machine agents using SPARQL. We 
discussed a set of conventions to describe the nature and 
availability of datasets on the emerging life science Seman- 
tic Web. In addition to basic descriptions, we focused our 
effort on elements of origin, licensing, (re-)distribution, 
update frequency, data formats and availability, language, 
vocabulary and content summaries. We expect that adher- 
ence to a small number of simple conventions will not 
only facilitate discovery of independently generated and 
published data, but also create the basis for the emergence 
of a data marketplace, a competitive environment to offer 
redundant access to ever higher quality data. These dis- 
cussions have continued in teleconferences hosted by the 
W3C Health Care and Life Sciences Interest Group 
(HCLSIG) [80], and included at various times stakeholders 
such as DBCLS, MEDALS, BioDBCore, Biological Linked 
Open Data (BioLOD) [81], Biositemaps, UniProt, Bio2RDF, 
Biogateway [82], Open PHACTS [83], EURECA [84] and 
Identifiers.org. 

Platforms 
RDFization tools 

Generation of RDF data often requires iterative trials. In 
an early stage of prototyping RDF data, it is recom- 
mended to use OpenRefine [85] (formerly known as 
Google Refine) with the RDF extension [86] for correct- 
ing fluctuations of data, generation of URIs from ID lit- 
erals and eventually converting tabular data into RDF. 
To automate the procedure, various hackathon initia- 
tives generated RDFization tools and libraries, particu- 
larly for the Bio* projects. A generic tool, bio-table [87], 
can be used for converting tabular data into RDF, using 
powerful filters and overrides. This command-line tool 
is freely available as a biogem package and expanded 
during the BioHackathon to include support for named 
columns. Another Ruby biogem binary and library called 
bio-rdf [88] utilizes bio-table and generates RDF data 
from the results of genomic analysis including gene en- 
richment, QTL and other protocols implemented in the 
R/Bioconductor. The Biolnterchange was conceived and 
designed during BioHackathon 2012 as a tool, web ser- 
vices and libraries for Ruby, Python and Java languages 
to create RDF triples from files in TSV, XML, GFF3, 
GVF and other formats including text mining results. 
User can specify external ontologies for the conversion 
and the project also developed biomedical ontologies of 
necessity for GFF3 and GVF data [89]. ONTO-PERL 
[90], a tool to handle ontologies represented in the OBO 
format, was extended to allow conversion of Gene Ontol- 
ogy (GO) annotations as RDF (GOA2RDF). Moreover, 
given that most legacy data resources have a correspond- 
ing XML schema, some effort was put into exploring and 
coding automated Schema-to-RDF translation tools for 



many of the widely used bioinformatics data formats such 
as BioXSD [91]. After working with the EDAM developers 
at the BioHackathon to modify their URI format to fit 
more naturally with an RDF representation, the EDAM 
ontology was successfully used to annotate the relevant 
portions of an automated BioXSD transformation, sug- 
gesting that significantly greater interoperability between 
bioinformatics resources should soon be enabled. 

Triple stores 

Moving from individual endpoints to multi-resource inte- 
gration, the BioHackathon working group on triplestores 
also explored the problem of deploying multiple, inter- 
dependent and distributed triplestores, as well as search- 
ing over these, which included the examination of cluster- 
based triplestores, Hadoop-based triple stores [92-94], and 
emergent federated search systems. The group determined 
that Hadoop-based stores were not mature enough to be 
used for production use because it works with only limited 
types of data, and lacks functionality such as exposing a 
SPARQL endpoint, user interface, and so on. Regarding 
cluster-based triplestores, the group found that there was 
insufficient documentation regarding installation so this 
could not be tested sufficiently. Federated search using 
SPARQL 1.1 [95] could only be tested on OWLIM [96] at 
the time, and it was found that queries could not work ef- 
ficiently across multiple endpoints. Thus, while single- 
source semantic publication seems to be well supported, 
the technologies backing distributed semantic datasets - 
both from the publisher's and the consumers perspective - 
are lacking at this time. 

Applications 

Semantic Web exploration and visualization 

The Semantic Web simplifies the integration of heteroge- 
neous information without the need for a pre-coordinated 
comprehensive schema. As a trade-off, querying Semantic 
Web resources poses particular challenges: how can a re- 
searcher understand what is in a knowledge base, and 
how can he or she understand its information structure 
enough to make effective queries? Interactive exploration 
and visualization tools offer intuitive approaches to infor- 
mation discovery and can help applied researchers to ef- 
fectively make use of Semantic Web resources. In the 
previous edition of the BioHackathon, a working group fo- 
cused on the development of prototypes to visualize RDF 
knowledge bases. As Semantic Web and Linked Data re- 
sources are becoming more available, in the life sciences 
and beyond, several new tools (interactive or not) for 
visualization of these kinds of resources have been pro- 
posed. The 2011 edition of the BioHackathon has created 
a review of such available tools, in view of their applicabil- 
ity in the biomedical domain. Through inspections and 
surveys we have gathered basic information on more than 
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30 tools currently available. In particular we have gathered 
information on: 

Requirements and avaOability The operating systems 
supported, hardware requirements, licensing and costs. 
Relevant to an applied biomedical domain, we have also 
considered the availability of simplified install procedures. 

Features The type of data access supported (e.g., via 
SPARQL endpoint or files-based), the type of query for- 
mulation supported (creating of graphic patterns, text 
based queries, boolean queries), whether some reasoning 
services is provided or exploited. Finally, when possible 
we have recorded some indication of type of user inter- 
action proposed (e.g., browsing versus link discovery). 

Assistance and support Whenever possible, we have 
collected information on the availability of community- 
based or commercial support, the availability of docu- 
mentations, the frequency of software updates and the 
availability of user groups and mailing lists, for which we 
have sketched approximate activity metrics. 

Technical aspects Whether the observed tools can be 
embedded in other systems, or if they provide a plugin 
architecture. When relevant, in which language they are 
developed, and finally which standards they support (e.g.. 
Void [97], SPARQL 1.1). 

Specificity to life sciences use cases Finally, we have 
tried to collect information highlighting the usability of 
these tools in life sciences research (e.g., life sciences 
bundled datasets, relevant demo cases, citations per re- 
search area). 

This collection of information is useful to decide which 
tools are potentially usable given constraints of technical, 
expertise or reliability nature. Following this data collec- 
tion exercise, we have started to devise a classification of 
tools, by identifying some defining key characteristics. For 
instance, a key characteristic of the surveyed tools is their 
approach to data: some focus more on instance data and 
tend to provide a graph-like metaphor. Some focus more 
on classes and relations and tend to present a class-based 
access. Another key aspect is the degree to which visuali- 
zation tools aim at supporting data exploration, rather 
than explanation. Based on our classification, we aim at 
choosing a few representative tools, provide some bench- 
marking and evaluate how different types of tools are ef- 
fective in simple tasks. 

Ontology mapping visualization 

Ontology mapping deals with relating concepts from 
different ontologies and is typically concerned with the 
representation and storage of mappings between the 



concepts [98]. BioPortal ontologies [40] are usually in- 
terconnected, and mappings between them are available, 
although a visualization of these mappings is not cur- 
rently available. Two types of mapping visualizations 
were explored at the BioHackathon: (1) A visualization 
of ontology mappings of all BioPortal ontologies, and 
(2) A visualization of a subset of BioPortal ontologies 
that would be useful in OntoFinder/Factory [99] - a tool 
for finding relevant BioPortal ontologies and also building 
new ontologies. The hackers investigated the applicability 
and utility of two tools/environments: Google Fusion 
Tables [100], and Gephi [101]. This work is ongoing. 

Identifier conversion service 

The existence of multiple synonyms for the same data 
(sets) often inhibits cross-resource querying and data min- 
ing. Thus, a centralized server containing curated links be- 
tween and among life-science databases would gready 
facilitate the data integration tasks in bioinformatics. The 
members of the G-language [102] group began developing 
an identifier conversion Web service named G-Links. 
Based on the cross referencing information available from 
UniProt and KEGG, this RESTful service retrieves all iden- 
tifiers and their corresponding PURLs related to an identi- 
fier provided by the user. In addition, users may supply 
nucleotide or amino acid sequences in place of the identi- 
fier, for rapid annotation of sequences. In order to comply 
with the recent Semantic Web and Linked Data initiatives, 
results can be returned in N-triples or RDF/XML formats 
for interoperability, as well as the legacy GenBank, EMBL 
and tabular formats (Table 2). This service is freely avail- 
able at http://link.g-language.org/. 

One of the central advantages of Linked Data as an end- 
user biologist is the ease of discovery and retrieval of re- 
lated information. On the other hand, biological data is 
highly inter-related, and the multitude of linkages can easily 
become overwhelming, resulting in familiar "hair balls" fre- 
quently seen in protein-interaction networks. Sophisticated 
filtering of Linked Data result sets, ranking the results ac- 
cording to relevance to one's interests, or by some form of 
enrichment of interesting phenomena would assist greatly 
in interpreting the content of semantic data stores. Such 
filtering, or data arrangement and presentation, should 



Table 2 Example queries using G-Links 


Query 


REST API 


GenelD:947170 


http://link.g-language.0rg/GenelD:9471 70 


by tabular format 




P0A7G5 (UniProt) 


http://link.g-language.org/P0A7G6/format=nt 


by N-Triple format 




lisa:126 (KEGG) 


http://link.g-language.0rg/hsa:1 26/format=rdf 


by RDF format 




POST sequence 


https://gist.github.eom/l 1 72846 


directly 
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ideally be accompanied by an intuitive visualization. 
Participants pursued these goals by first generating a 
complete genome (gene set) of Escherichia coli as Linked 
Data using G-Links, together with several associated nu- 
merical datasets calculated through the G -language REST 
Web service [103] (a product of BioHackathon 2009). Sta- 
tistics such as Cramer s V for nominal data and Spearman's 
rank correlation for continuous data were applied to data 
coming from multiple, overlapping sources (e.g. KEGG 
versus Reactome [104] versus BioCyc [105] for pathways) 
to cluster result sets according to their similarity. This 
would allow, for example, a user to choose the least- 
redundant subset of results in order to maximize the 
amount of unique information passed to a visualization 
tool. Using the inverse, these metrics can be used to 
screen for enrichment, where over-representation of the 
same dataset is considered meaningful, and therefore 
that dataset should be highlighted. An example of both 
types of filtering was created by the participants using 
the JavaScript Info Viz Tookit [106]. The resulting graph 
is highly interactive, and all nodes representing data sets 
can be clicked to re-layout the graphs centering to the 
clicked data set, with animations. Demonstrations using 
pre-calculated E. coli data are available [107,108]. 

Natural language semantic query via voice recognition 

Finally, the project that generated the most "buzz" among 
the participants in BioHackathon 2012 was Genie - a "Siri 
[109] for Biologists". The G-language Project members 
undertook the development of a virtual research assistant 
for bioinformatics, designed to be an intuitive entry-level 
gateway for database searches. The prototype developed 
and demonstrated at the BioHackathon was limited to 
gene- and genome-centric questions. Users communicate 
with Genie using spoken English, and Genie replies in a 
synthesized voice. Genie can find information on three 
main categories: 1. Anything about a gene of interest, such 
as, what is the sequence, function, cellular localization, 
pathway, related disease, related SNPs and polymorphisms, 
interactions, regulations, expression levels; 2. Anything 
about a set of genes, based on multiple criteria. For ex- 
ample, all SNPs in genes that are related to cancer, that 
work as transferases, that are expressed in the cytoplasm, 
and that have orthologs in mice; 3. Anything about a gen- 
ome, such as, production of different types of visual maps, 
calculation of GC skews, prediction of origins and 
terminus of replication, calculation of codon usage bias, 
and so on. Using an NLP and dictionary-based approach, 
with the species name as a top-level filter to reduce the 
search/retrieval space, annotations are fetched for this 
species, and a dictionary of gene names is created dynam- 
ically. In order to implement integrated information 
retrieval, the following software systems were used: 



• The G-language Genome Analysis Environment and 
its REST service which allows for extremely rapid 
genome-centric information retrieval. 

• G-language Maps (Genome Projector and Pathway 
Projector, as well as Chaos Game Representation 
REST Service) which visualizes that genomic 
information. 

• Keio Bioinformatics Web Services EMBASSY 
package and EMBOSS [110], which provides more 
than 400 tools that can be applied to the 
information. 

• G-Links - an extremely rapid gene-centric data 
aggregator. 

The Genie prototype is accessible online [111,112]. 
Conclusions 

BioHackathon series started out with the Integrated 
Database Project of Japan, aiming to integrate all life sci- 
ence databases in Japan. Initially, the focus was on Web 
services and workflows to enable efficient data retrieval. 
However, the focus eventually shifted towards Semantic 
Web technologies due to the increasing heterogeneity 
and interlinked nature of the data at hand, for example, 
from the accumulation of next-generation sequencing 
data and their annotations. From this, the community 
recognized the importance of RDF and ontology devel- 
opment - fundamental Semantic Web technologies that 
have also come to gain the attention of other domains in 
the life sciences, including genome science, glycosciences 
and protein science. For example, BioMart and InterMine, 
which were initially developed to aid the integration of life 
science data, has now started to support Semantic Web 
technologies. These hackathons have served as a driving 
force towards integration of data "islands" that have slowly 
started linking to one another through RDF development. 
However, insufficient guidelines, ontologies and tools to 
support RDF development has hampered true integration. 
The development of such guidelines, ontologies and tools 
has been the central focus of these hackathons, bringing 
together the community on a consistent basis, and we 
have finally started to grow buds from these efforts. We 
expect to bear fruit in the near future by the development 
of biomedical and metagenome applications on top of 
these developments. Moreover, we expect that text mining 
will become increasingly vital to enriching life science Se- 
mantic Web data with the knowledge currently hidden 
within the literature. 
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